{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QfvWIKUf0V5V"
   },
   "source": [
    "# DrEvalPy Demo\n",
    "You can execute the DrEval Framework either via Nextflow as nf-core pipeline or as Python standalone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install drevalpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let us see which dataset and models are already implemented in drevalpy.\n",
    "You can test your own model on all the datasets and comapre your model to all.the implemented ones:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from drevalpy.models import MODEL_FACTORY\n",
    "from drevalpy.datasets import AVAILABLE_DATASETS\n",
    "print(f\"Models: {list(MODEL_FACTORY.keys())}\")\n",
    "print(f\"Dataset: {list(AVAILABLE_DATASETS.keys())}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let us first train a model on the toy dataset. It will download the dataset for you.\n",
    "from drevalpy.experiment import drug_response_experiment\n",
    "\n",
    "naive_mean = MODEL_FACTORY[\"NaivePredictor\"] # a naive model that just predicts the training mean\n",
    "enet = MODEL_FACTORY[\"ElasticNet\"] # An Elastic Net based on drug fingerprints and gene expression of 1000 landmark genes\n",
    "simple_nn = MODEL_FACTORY[\"SimpleNeuralNetwork\"] # A neural network based on drug fingerprints and gene expression of 1000 landmark genes\n",
    "\n",
    "toyv2 = AVAILABLE_DATASETS[\"TOYv1\"](path_data=\"data\")\n",
    "\n",
    "drug_response_experiment(\n",
    "            models=[enet, simple_nn],\n",
    "            baselines=[naive_mean], # Ablation studies and robustness tests are not done for baselines.\n",
    "            response_data=toyv2,\n",
    "            n_cv_splits=2, # the number of cross validation splits. Should be higher in practice :)\n",
    "            test_mode=\"LCO\", # LCO means Leave-Cell-Line out. This means that the test and validation splits only contain unseed cell lines.\n",
    "            run_id=\"my_first_run\",\n",
    "            path_data=\"data\", # where the downloaded drug response and feature data is stored\n",
    "            path_out=\"results\", # results are stored here :)\n",
    "            hyperparameter_tuning=False) # if True (default), hyperparameters of the models and baselines are tuned.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "os.listdir(\"results/my_first_run/TOYv1/LCO\")\n",
    "# the results folder holds splits and the results for all models. Lets look at the predictions of the simple neural network for the 0'th fold:\n",
    "pd.read_csv(\"results/my_first_run/TOYv1/LCO/SimpleNeuralNetwork/predictions/predictions_split_0.csv\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# you can generate your own evaluations from these predictions.\n",
    "# However, we recommend using our evaluation pipeline, which calculates meaningful metrics, creates figures and prepares an HTML report:\n",
    "from drevalpy.visualization.create_report import create_report\n",
    "create_report(run_id=\"my_first_run\", dataset=\"TOYv1\")\n",
    "\n",
    "# this will create a report in the results/my_first_run/index.html which you can open in your browser."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We prefer running this in the console:\n",
    "\n",
    "!drevalpy --models RandomForest --dataset_name TOYv1 --n_cv_splits 2 --test_mode LPO --run_id my_second_run --no_hyperparameter_tuning\n",
    "!drevalpy-report --run_id my_second_run --dataset TOYv1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qWbDZA4X17Tj"
   },
   "source": [
    "## Using the drevalpy nextflow pipeline for highly optimized runs:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7145560s6U-K"
   },
   "source": [
    "You should use DrEval with Nextflow on high-performance clusters or clouds. Nextflow supports various systems like Slurm, AWS, Azure, Kubernetes, or SGE. On a local machine, you can also use the pipeline but probably, the overhang from spawning processes is not worth it so you might prefer the standalone. Nextflow needs a java version >=17, so we need to install that, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install nextflow\n",
    "!apt-get install openjdk-17-jre-headless -qq > /dev/null\n",
    "!java --version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we need a demo config for nextflow because on colab, we only have two CPUs available:\n",
    "with open('demo.config', 'w') as f:\n",
    "  f.write('process {\\n')\n",
    "  f.write('\\tresourceLimits = [\\n')\n",
    "  f.write('\\t\\tcpus: 2,\\n')\n",
    "  f.write('\\t\\tmemory: \"3.GB\",\\n')\n",
    "  f.write('\\t\\ttime: \"1.h\",\\n')\n",
    "  f.write('\\t]\\n')\n",
    "  f.write('}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wvc80Ahz6jj4"
   },
   "source": [
    "We run the pipeline with the TOYv1 dataset which was subset from CTRPv2. For the demo, we don't do hyperparameter tuning and we just do 2 CV splits. We want to inspect the final model which is why we train a final model on the full dataset. This should take about 10 minutes.\n",
    "If you were on a compute cluster, you could now decide if you want to run the pipeline inside conda, docker, singularity, ... via the -profile option (-profile singularity, e.g.). If you want the executor to be slurm/..., you can write this in your config. You can find plenty of config examples online, e.g., the one for our group: [daisybio](https://github.com/nf-core/configs/blob/master/conf/daisybio.config)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nextflow run nf-core/drugresponseeval -r dev -c demo.config --dataset_name TOYv1 --models ElasticNet --baselines NaiveMeanEffectsPredictor --n_cv_splits 2 --no_hyperparameter_tuning --final_model_on_full_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hZ6AcGPDA-yo"
   },
   "source": [
    "The results will be stored in `results/my_run`. You can inspect pipeline information like runtime or memory in `results/pipeline_info`. In `my_run/report`, you can find the html report where you can look at your results interactively. The underlying data is in `my_run/evaluation_results.csv` or `true_vs_pred.csv`.\n",
    "\n",
    "We now inspect the final model saved in `results/my_run/LCO/ElasticNet/final_model` with `drevalpy` functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from drevalpy.models import MODEL_FACTORY\n",
    "enet_class = MODEL_FACTORY[\"ElasticNet\"]\n",
    "enet = enet_class.load(\"results/my_run/LCO/ElasticNet/final_model\")\n",
    "enet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WT4P7OBsDgWq"
   },
   "source": [
    "We now want to extract the top scoring features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the top features\n",
    "cell_line_input = enet.load_cell_line_features(data_path=\"data\", dataset_name=\"TOYv1\")\n",
    "drug_input = enet.load_drug_features(data_path=\"data\", dataset_name=\"TOYv1\")\n",
    "all_features = list(cell_line_input.meta_info['gene_expression'])+[f'fingerprint_{i}' for i in range(128)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame({'feature': all_features, 'coef': enet.model.coef_})\n",
    "df.sort_values(by=\"coef\", ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Top 50 features:\")\n",
    "list(df.sort_values(by=\"coef\", ascending=False)[\"feature\"][:50])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yErEV4pCK6-B"
   },
   "source": [
    "The fingerprints are the most important features as the drug identity is responsible for the most variation between responses."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
